专利摘要:
The present invention relates to a method for managing a computing task on a functionally asymmetric multi-core processor, wherein at least one core of said processor is associated with one or more hardware extensions, the method comprising the steps of receiving a task of calculation associated with instructions executable by a hardware extension; receive calibration data associated with the hardware extension; and determining an opportunity cost of performing the computing task based on the calibration data. Developments describe the determination of the calibration data, in particular by counting or by calculation (online and / or offline) of the classes of instructions executed, the execution of a predefined set of instructions representative of the execution space of extension, consideration of energy and temperature aspects, translation or emulation of instructions or placement of calculation tasks on different cores. System and software aspects are described.
公开号:FR3035243A1
申请号:FR1553478
申请日:2015-04-20
公开日:2016-10-21
发明作者:Alexandre Aminot;Yves Lhuillier;Andrea Castagnetti;Alain Chateigner;Henri-Pierre Charles
申请人:Commissariat a lEnergie Atomique CEA;Commissariat a lEnergie Atomique et aux Energies Alternatives CEA;
IPC主号:
专利说明:

[0001] FIELD OF THE INVENTION The invention relates to multi-core processors in general and the management of the placement of calculation tasks on multi-core processors that are functionally asymmetrical in particular. SUMMARY OF THE INVENTION
[0002] STATE OF THE ART A multi-core processor may comprise one or more hardware extensions, intended to accelerate very specific parts of software codes that are difficult to parallelize. For example, these hardware extensions may include circuits for floating point computation or vector computation. A multi-core processor is said to be "functionally asymmetric" when certain extensions fail some processor cores, ie when at least one core of the multi-core processor does not have the hardware extension required for execution of a given instruction set. There is a common instruction set for all hearts and a specific instruction set that is only executable by certain predefined hearts. By uniting all the instruction sets of the cores composing the multi-core processor, all the instructions (of the application) are represented. A functionally asymmetric processor can be characterized by unequal distribution (or association) of extensions to processor cores.
[0003] 2 303 52 43 The management of a functionally asymmetrical multi-core processor poses several technical problems. One of these technical problems is to effectively manage the placement of computing tasks on the 5 different processor cores. Software applications use these hardware extensions in a dynamic way, that is to say, that varies over time. For the same application, some calculation phases will use almost a full extension at almost full load (eg calculations on floating-point data) while other computation phases will use it little or not at all (eg calculations on integer data). Using an extension is not always effective in terms of performance or energy ("quality" of use).
[0004] The published work concerning the placement of computational tasks on functionally asymmetric multi-core processors does not describe satisfactory solutions.
[0005] Patent document WO2013101139 entitled "PROVIDING AN ASYMMETRIC MULTICORE PROCESSOR SYSTEM TRANSPARENTLY TO AN OPERATING SYSTEM" discloses a system comprising a multi-core processor with several groups of cores. The second group may be of an instruction set architecture (ISA) different from the first group, or of the same ISA architecture to be defined but with a different power and performance level of media. The processor further includes a migration unit that processes the migration requests for a number of different scenarios and causes a context switch to dynamically migrate a process from a second heart to a first core of the first group. This dynamic hardware context switch 3 3035243 may be transparent to the operating system. This approach has limitations for example in terms of placement flexibility and instruction set dependency. There is a need for methods and systems for managing computational tasks on functionally asymmetrical multi-core processors, wherein the processor cores are associated with one or more hardware extensions.
[0006] SUMMARY OF THE INVENTION The present invention relates to a method for managing a computation task on a functionally asymmetric multi-core processor, wherein at least one core of said processor is associated with one or more hardware extensions, the method comprising the steps of receiving a computational task associated with executable instructions by a hardware extension; receive calibration data associated with the hardware extension; and determining an opportunity cost of performing the computing task based on the calibration data. Developments describe the determination of the calibration data in particular by counting or by calculation (online and / or offline) of the classes of the instructions executed, the execution of a predefined set of instructions representative of the execution space. extension, consideration of energy and temperature aspects, translation or emulation of instructions or placement of computing tasks on the different cores. System and software aspects are described. According to one embodiment, the invention is implemented at the hardware level (hardware) and the operating system. The invention retrieves and analyzes certain information of the last execution quanta (time allocated by the scheduler to the task on a heart) and thus estimates the cost of future task placement on different types of hearts. Advantageously according to the invention, the task placement is flexible and transparent for the user.
[0007] Advantageously, the method according to the invention performs the placement of the calculation tasks according to more diversified and global objectives than the known solutions of the state of the art which stick to the instructions associated with the calculation tasks. In fact, according to the known solutions, the placement of the calculation tasks involving the switching on or off of one or more cores is done considering only the "strict" use of the extension. Indeed, a core that does not include extensions can not execute instructions for one or more particular extensions. As a result, current solutions examine whether an extension is used (e.g. placement of the application on an expanded core) or not used (e.g. placement on a heart without extension). In other words, the known approaches consider the presence of call of such extensions within the source code but proceed to the placement of the tasks exclusively according to the code instructions and ignore in particular any other criterion, including energy. Other known approaches are based on an analysis of the source code and the planned use of hardware extensions, carry out code mutations, but without estimating criteria including, for example, degradation of performance or energy. Advantageously, the invention can meet the so-called "multiobjective" needs to be met by a task scheduler such as a) performance, b) energy efficiency and c) thermal stresses 30 ("dark-silicon"). Some embodiments allow to increase the number of cores of a multi-core processor, while maintaining a high performance / energy / surface efficiency. The additional programming effort remains small. Advantageously according to the invention, the prediction of the costs and the performance gains of a given task on each of the different cores of a heterogeneous system can be carried out dynamically. Advantageously, the method according to the invention takes into account dynamic criteria related to the execution of a program. Currently, the exploitation of a hardware extension by software code is dictated either explicitly by the code programmer or automatically by the compiler. Since the compilation of software is most often done offline, the programmer or the compiler can not base his choice whether to use an extension or not on criteria which are related to the software itself, and not on dynamic criteria for program execution. Without information about the execution environment (workload, instant task scheduling, availability of resources), it is generally impossible to determine in advance whether or not to exploit a given hardware extension. The method according to the invention makes it possible to predict and / or estimate the use of one or more hardware extensions. Advantageously, according to the invention, it becomes possible to implement a dynamic scheduling and task placement strategy, raising various constraints such as (i) predictability and interoperability in the use of constrained heterogeneous systems (functional asymmetry with base common) and (ii) optimization with regard to overall objectives of the system (eg performance, consumption and surface area), enabled by means of a fast, dynamic and transparent prediction of the execution of a given code on a given system. heart given.
[0008] Advantageously, some embodiments of the invention include prediction steps as to the use of the extensions, which can advantageously optimize energy efficiency and optimize computational performance. These embodiments may in particular leave some freedom for the scheduler as to the choice of cores on which to execute the different program phases. Advantageously, experimental results have shown that, being no longer constrained by the instruction set, the dependence on quantum size is very small. The percentage of time spent on a basic body are very close regardless of the size of quanta, which makes it possible to be independent of other parameters of the scheduler. Advantageously, the method according to the invention makes it possible to optimize the energy consumption, including in the case of a weak use of an extension. Advantageously, the method according to the invention makes it possible to place a task on a processor, whether this processor is associated with a hardware extension or not, depending on the implementation of the task (thus with or without exploitation of the extension) . As a result, a scheduler can dynamically optimize energy or performance without worrying about the initial implementation of the task.
[0009] Advantageously, the invention makes it possible to predict or dynamically determine the advantage of using one or more hardware extensions and to place the calculation tasks (ie to allocate these tasks to the different processors and or cores of processors). according to this prediction.
[0010] Advantageously, the method according to the invention makes it possible to delay the decision whether or not to use an extension at runtime, gives the programmer a higher level of abstraction and provides the system software with increased decision-making freedom. more optimization in terms of scheduling. Once this flexibility is achieved, the quality of system software decisions becomes solely dependent on the execution environment. To do this, the system software measures the relevant variables of the runtime environment. Advantageously, in a manner that is transparent to the programmer and in a dynamic manner for the compiler, the method according to the invention makes it possible to have task migration freedoms and accumulated knowledge concerning the execution environment in order to optimize the placement. computing tasks on one or more asymmetrically functional multi-core processors, and this depending on the overall objectives of the scheduler. Advantageously, the method according to the invention confers freedom of action with regard to the placement of calculation tasks. Such freedom of action allows the system to migrate (computing) tasks to any processor core, despite the different extensions present in each of these cores. The software applications running on the system are associated with a more continuous and flexible exploration space to achieve the objectives (or multi-objectives) in terms of power (eg thermal performance).
[0011] Advantageously, embodiments of the invention may be implemented for "System-on-Chip" (SoC) for consumer-type or embedded electronic applications (eg phones, embedded components, desktop, Internet of Things, etc. ). In these 30 application domains, the use of heterogeneous systems is common to optimize computational efficiency for a specific workload. Since the latter systems are becoming more complex (eg increasing the number of cores and workloads), certain embodiments of the invention make it possible to reduce the impact of this complexification, while disregarding the heterogeneity of the systems. Advantageously, experimental results (benchmarks mibench, SDVBS) have demonstrated gains in performance and energy consumption compared to the known system and method of the state of the art. Advantageously, the method according to the invention allows a dynamic optimization of the performance and the energy thanks to the estimator of the degradation and the unit of prediction.
[0012] Advantageously, the scalability of multi-core processors is improved. In particular, the asymmetry management of the platform is done transparently for the user, that is to say does not increase the effort or constraints of software development.
[0013] Advantageously, an optimization of the scheduling makes it possible to better respond to technical problems which increase with the number of cores such as the reduction in the area and the energy consumed, the priority of the tasks, the dark-silicon (temperature).
[0014] Advantageously, the placement / scheduling of tasks according to the invention is flexible. In addition, the scheduling is done in an optimized and transparent manner for the user.
[0015] Advantageously in comparison with known solutions, the method according to the invention provides for the use of prediction units ("predictors") and / or estimators of interest, allowing optimized task placement. Known approaches that monitor program execution and attempt to optimize the placement of computational tasks are generally interested only in the internal differences of the same processor (eg "in order / out-of-order" , frequency difference, etc.) and can not be applied to the hardware extensions themselves (data, models and constraints are different).
[0016] Advantageously according to the invention, the dependency on the instruction set is reduced or even eliminated. Benefits include improved energy efficiency and performance. The use of an extension does not always bring gains in performance and energy. For example, a calculation phase with low use of an extension makes it possible to accelerate the computations (in comparison with an execution carried out on a core without extension) but the energy consumption can be increased because of the energy static expansion (not compensated by the gain in execution time). Similarly, in terms of performance, the compiler can use an extension by planning that it will speed up the execution, but because of the additional memory displacement (eg between the extension and the "basic" core registers), the performance will actually be depreciated.
[0017] Advantageously, the choice of task placement according to the method is flexible. In current systems, the objective of the operating system is not always to optimize the performance, the objective can be a minimization of the energy consumed or a placement optimizing the temperature of the hearts ("dark silicon") . Due to the dependency on the instruction set, the scheduler may be forced to execute an application according to its resource requirements and not considering a global objective. Advantageously according to the method, the placement is independent of other parameters supervised by the scheduler. A scheduler 5 generally allocates a quanta of time to each computing task on a processor. If the calculation of a task is not completed, another quanta will be associated with it. The size of these quanta is variable to allow a fair and optimized sharing of different tasks between different processors (typically between 0.1 and 100ms). The dimensioning of the quanta involves a more or less fine detection of the basic type phases. There may be edge effects (eg, ceaseless migrations). Sticking to these edge effects ultimately reduces investment flexibility and optimizations.
[0018] DESCRIPTION OF THE FIGURES Various aspects and advantages of the invention will appear in support of the description of a preferred embodiment of the invention, but without limitation, with reference to the figures below: FIG. 1 illustrates examples processor architectures; FIGS. 2A and 2B illustrate some aspects of the invention in terms of energy efficiency, depending on whether hardware extensions are used or not; Figure 3 illustrates examples of architectures and job placement; FIG. 4 provides examples of steps of the method according to the invention.
[0019] The invention generally allows an optimized placement of computing tasks on a functionally asymmetric multi-core processor. A functionally asymmetric multi-core processor comprises programmable elements or processor cores, utilizing more or less extensive functionalities. A "full" heart (i.e. with one or more hardware extensions) is a "basic" core (or "basic" in English, i.e. without a hardware extension) augmented by one or more hardware extensions. A "hardware extension" ("hardware extension") is a circuit such as an FPU floating data computing unit, a vector computing unit, a SIMD, a cryptographic processing unit, a unit signal processing, etc. A hardware extension introduces a dedicated hardware circuit accessible or connected to a processor core, which circuit provides high performance for specific computing tasks. These specific circuits improve the performance and energy efficiency of a core for particular calculations, but their intensive use can lead to reduced performance in terms of Watt per unit area. These hardware extensions associated with the processor core are provided with an instruction set that extends the standard or default (ISA) set. Hardware extensions are usually well integrated into the core pipeline, which allows efficient access to functions through instructions added to the "basic" set (by comparison, a specialized "coprocessor" usually requires instructions that are related to specific protocols).
[0020] 12 3035243 A calculation task (or "thread") comprises instructions, which may be grouped into instruction sequences (temporally) or games or classes of instructions (by nature). The term "computational task" (or "task") refers to a "thread" or "thread of execution". Other denominations include phrases such as "light process", "processing unit", "execution unit", "instruction thread", "lightened process" or "exetron". The expression designates the execution of a set of instructions of the machine language of a processor. From the user's point of view, these executions seem to run in parallel. However, where each process has its own virtual memory, threads in the same process share virtual memory. However, all threads have their own call stack. A computational task does not necessarily use a hardware extension: in the most common case, a computational task is performed using instructions common to all processor cores. A calculation task can (optionally) be executed by a hardware extension (which nevertheless requires the associated core to "decode" the instructions).
[0021] The function of a hardware extension is to speed up the processing of a specific instruction set: a hardware extension may not speed up the processing of another type of instruction (e.g. floating point versus integer).
[0022] A processor core may be associated with none (i.e., zero) or one or more hardware extensions. These hardware extensions are then "exclusive" (a given hardware extension can not be accessed from a third core). In some architectures, the processor core includes the hardware extension (s). In other architectures, the physical circuits of a hardware extension may include a processor core. A processor core is a set of physical circuits capable of executing programs "autonomously". A hardware extension is able to execute a part of the program but is not "autonomous" (it requires the association with at least one processor core).
[0023] A processor core may have all the hardware extensions required to execute the instructions contained in a given task. A processor core may also not necessarily have all the hardware extensions required to execute the instructions included in the computational task: the technical problem of moving and / or emulating (ie, the functionality or the instruction) - and associated costs - arise. Hardware extensions may be of a different nature. An extension can process a set of instructions specific to it.
[0024] Hardware expansion is generally expensive in circuit area and static energy. An asymmetric platform, compared to a symmetric processor, reduces surface and energy cost by reducing the number of extensions and by turning on or off hearts containing extensions only when this is necessary or advantageous. Regarding the nature of the association between the processor cores and the hardware extensions: in one embodiment, a processor core accesses exclusively one or more hardware extensions dedicated thereto. In another embodiment, an extension may be "shared" between multiple processor cores. In these different configurations, the computational load should be distributed as well as possible. For example, it is advantageous not to overload an extension that would be shared.
[0025] The operating system through the scheduler provides quanta of time, each quantum of time being allocated to a given software application. According to one embodiment of the invention, the scheduler (or "scheduler" in English) is preemptive type (i.e. the time quanta are imposed on software applications). A method (computer implemented) for managing a computing task on a functionally asymmetric multi-core processor is disclosed, wherein at least one core of said processor is associated with one or more hardware extensions, the method including steps of receiving a calculation task, said computing task being associated with executable instructions by a hardware extension associated with the multi-core processor; receive calibration data associated with said hardware extension; and determining an opportunity cost of performing the computation task based on the calibration data. One or more opportunity costs of execution can be determined. At least one of the plurality of cores is thus characterized by an opportunity cost of executing the received computational task. The method according to the invention considers various "opportunities" of placement, i.e. possibilities or potentialities, which are analyzed. In a development, the computation task comprises instructions associated with one or more predefined classes of instructions and the hardware extension is associated with one or more predefined classes of instructions, said classes being executable by said extension.
[0026] In one development, the calibration data comprises coefficients indicative of a unit execution cost per instruction class, said coefficients being determined by comparison between the execution of a predefined set of instructions representative of the execution space of said extension on said hardware extension 5 on the one hand and the execution of said predefined set of instructions on a processor core without hardware extension on the other hand. The "predefined set of instructions representative of the execution space" (ER) is intended to represent the different possibilities of executing the instructions, in the most complete manner i.e. the most exhaustive possible. With regard to the nature of the instructions (i.e. the classes or types of instructions), completeness can be achieved. On the other hand, the combinatorics of the sequences of the different instructions being virtually infinite, the representation is necessarily imperfect. It can nevertheless be approached asymptotically. The principle of executing a set of instructions makes it possible to determine "control points" of the method according to the invention (e.g., effective calibration data).
[0027] Experimentally, a hundred or so "programs" have been conducted, each containing "tests" (unitary executions) and the execution of software applications in real conditions. The unit tests aimed to execute a small number of instruction types, mainly floating point, control and / or memory. Different benchmarks have been used (eg MiBench, SDVBS, WCET, fbench, Polybench). Execution of real software applications aimed to represent the main sequences of instructions execution. In particular, software applications used different data sets.
[0028] In one embodiment each software application may be compiled for a processor core with hardware extension and also compiled for a core without an extension. Both versions of binaries are then executed on the different cores. The difference in execution time is determined as well as the number and nature of the classes of instructions executed. For a fairly large set of applications, it is possible to determine a scatter plot representing the total execution space. As a result, it is possible to correlate the number of instructions associated with each instruction class with the difference in execution time between hearts with or without extension. To improve the representativeness property of the representative space (ER), the number of software applications can be increased. By the use of statistical techniques, this representativeness can be determined (use of random instructions, thresholds, confidence intervals, cloud distribution, etc.). As a result, a minimum (or optimal) number of software applications to be executed under real conditions can be determined. By comparing the execution of instructions on a heart with extension and on a heart without extension, some results have highlighted deviations of the order of 10 to 20% between the performances as estimated according to the method and the performance actually measured. This order of magnitude reflects the possibility of operational implementation of the invention (and this without additional optimization).
[0029] In a development, the method further comprises a step of determining a number of uses of each class of instructions associated with the computing task by said hardware extension.
[0030] In a development, the step of determining the number of uses of each instruction class includes a step of counting the number of uses of each class of instructions. In one embodiment, the history i.e. the past is used to estimate or evaluate the future.
[0031] In a development, the step of determining the number of uses of each instruction class includes a step of estimating the number of uses of each instruction class, including from the uses counted in the past. It is also possible to estimate the number of uses from other methods. It is also possible to combine counting and estimating the number of uses of instruction classes. In a development, the opportunity cost of execution is determined by summation indexed by instruction class of the coefficients per class of instructions multiplied by the number of uses per instruction class. In this specification, the opportunity cost of execution is also sometimes referred to as "degradation" (from the perspective of a loss of performance). Symmetrically, it can also be 20 "execution gains" ("accelerations" or "benefits" or "improvements"). The opportunity cost of execution can be determined from the number of uses that are actually counted and / or estimated based on the count history.
[0032] Advantageously, the upstream determination of the "opportunity cost of execution" (specifically defined by the invention) allows efficient optimizations conducted downstream (as described below). The "opportunity cost of execution" according to the invention therefore corresponds to a "reading grid" specific to the processor and the management of the calculation tasks, that is to say to the definition of a " intermediate result '(decision support) allowing thereafter to conduct specific management steps and dependent on this characterization. This perspective is particularly relevant insofar as it makes it possible to control the processor efficiently (intermediate aggregates make it possible to improve the controllability of the system). As a reminder, in a specific way, taking into account the classes of instructions and the number of uses of these classes of instructions allow the determination of said "opportunity cost of execution". In a development, the coefficients are determined offline. It is possible to determine the calibration data once and for all ("offline"). For example, the calibration data can be provided by the processor manufacturer. The calibration data can be present in a configuration file.
[0033] In a development, the coefficients are determined online. The coefficients can be determined during the execution of a program, for example at the "reset" of the platform. In an open system (for example a cluster or "cloud") whose topology is unknown a priori, it is possible to calibrate each extension at startup and to determine the topology globally. In a development, the coefficients are determined by multivariate statistical analysis. Different multivariate statistical analysis techniques can be used, possibly combined. The (linear) regression is advantageously fast. Principal component analysis (PCA) advantageously makes it possible to reduce the number of coefficients. Other usable techniques include factorial analysis of correspondence (AFC), so-called factorial analysis, data partitioning (clustering), multidimensional scaling (MDS), analysis of similarities between variables, multiple regression analysis, analysis of the ANOVA variance (bivariate), and multivariate generalization (multivariate analysis of variance), discriminant analysis, canonical correlation analysis, logistic regression ( LOGIT model), artificial neural networks, decision trees, models of structural equations, joint analysis etc. In one development, the received computation task is associated with a predetermined processor core and the opportunity cost of executing the computation task is associated with a processor core other than the predetermined processor core. It is generally considered (but not required) what would be the cost of execution on the predetermined processor (if any), ie what would be the cost of "chasing" execution. In other cases, the other "candidate" hearts are taken into consideration (one, several or all of the addressable hearts).
[0034] In a development, the opportunity cost of performing the computation task is determined for at least one processor core other than the predetermined processor core. In one development, all processor cores are considered each in turn and the placement optimization is to minimize the opportunity cost of execution (i.e., for example, to select the associated processor core at the lowest or lowest opportunity cost of execution). In some embodiments, such a "minimum" function can be used. In other embodiments, specific algorithms (sequence of steps) and / or analytic functions and / or heuristics may be used (for example, the candidate cores may be compared in pairs and / or sampled in a variety of ways. for example, to speed up investment decision-making).
[0035] In addition (that is to say, entirely optional) to the consideration of the opportunity cost of execution, other criteria can be taken into account to optimize the placement of the calculation task. . These criteria may be taken into account in additional ways (but in some cases may be substituted for the opportunity cost of execution criterion). These generally complementary criteria may include, in particular, parameters relating to the execution time of the calculation task and / or the energy cost associated with the execution of the calculation task and / or the temperature (ie the local consequence of the calculation task). execution considered). The various costs (execution opportunities, temperature, energy, performance, etc.) can be compared with each other and various arbitration logic can select a particular core in consideration of these different criteria. Regarding combinatorial optimization or multi-objective optimization (some objectives may be antagonistic), different mathematical techniques are applicable. The weighting of these different criteria can in particular be variable and / or configurable. The respective weights allocated to the different placement optimization criteria can for example be configured online or offline. They can be "static" or "dynamic". For example, the priority and / or weight of these different criteria may be variable over time. Analytical functions or algorithms can regulate the various allocations or arbitrations or compromises or priorities between criteria for optimizing the placement of the calculation tasks. In one development, the energy cost determination comprises one or more of the steps of receiving initial indications of use of one or more predefined hardware extensions and / or receiving energy consumption states. (eg DVFS) per processor core and / or receiving performance asymmetry information and a step of determining a power-gating and / or clock-gating energy optimization.
[0036] In one development, the method further comprises a step of determining an adaptation cost of the instructions associated with the computing task, said step comprising one or more of the steps of translating one or more instructions and / or select one or more instruction versions and / or emulate one or more instructions and / or execute one or more instructions in a virtual machine. The adaptation of the instructions becomes necessary if it is determined, following the preceding steps, a processor core not having the required hardware extension (core "not equipped"). In a development, the method further comprises a step of receiving a parameter and / or a logic rule scheduling and / or placement. This development highlights the different possibilities for "controllability" of the process according to the invention. Logical rules (Boolean expressions, fuzzy logic, business rules, etc.) can be received from third-party modules. Also factual threshold values, such as maximum temperatures, time ranges, etc.
[0037] In one development, the method further comprises a step of moving the calculation task from the predetermined processor core to the determined processor core. In one development, the method further includes a step of disabling or shutting down one or more processor cores. Disable or "cut the clock signal" or "put the heart into a reduced state of consumption" (eg decrease the clock frequency) or "turn off" ("dark silicon") In a development, the multi processor Functionally unbalanced 5-heaters is a physical processor or a virtual processor. The processor is a tangible or physical processor. The processor may also be a virtual processor, i.e. logically defined. For example, the scope can be defined by the operating system. A processor can also be determined by a hypervisor.
[0038] A computer program product is disclosed, said computer program comprising code instructions for performing one or more process steps.
[0039] There is disclosed a system comprising means for carrying out one or more process steps. In one development, the system comprises a functionally asymmetric multi-core processor, at least one core of said processor being associated with one or more hardware extensions, the system comprising receiving means for receiving a computing task, said computing task being associated with executable instructions by a hardware extension associated with the multicore processor; receiving means for receiving calibration data; and means for determining an opportunity cost of executing the calculation task based on the calibration data. In one development, the system further comprises means selected from placement means for placing one or more computing tasks on one or more cores of the processor; means for counting the use of instruction classes by a hardware extension, said means including software and / or hardware counters; means or registers for saving the execution context of a calculation task; means for determining the migration cost and / or the adaptation cost and / or energy cost associated with continuing the execution of a calculation task on a predefined processor core; means for receiving one or more parameters and / or scheduling rules; means for determining and / or selecting a processor core; means for executing on a processor core without associated hardware extension a computational task initially provided for executing on a processor comprising one or more hardware extensions; means for moving a compute task from one processor core to another processor core; and means for disabling or shutting down one or more processor cores.
[0040] Figure 1 illustrates examples of processor architectures. A functionally asymmetric multi-core processor FAMP 120 is compared to a symmetric multi-core processor SMP 110 where each core contains all hardware extensions and compared to a SMP 130 symmetric multi-core processor containing only basic cores. The architecture may be shared memory or distributed. Hearts can sometimes be connected by a NoC (Networkon-Chip).
[0041] A FAMP 120 architecture is close to that of an SMP 130 symmetric multi-core processor. The memory architecture remains homogeneous, but the features of the cores can be heterogeneous. As illustrated in the example of FIG. 1, a FAMP architecture 120 may comprise four different types of cores (with the same base). The size of the processor can be reduced. Because of the ability to turn on or off the cores, and therefore to choose which one to turn on as needed by the software application, different optimizations can be made (gain in energy, reduction of the temperature of the cores, etc.). .).
[0042] Figures 2A and 2B illustrate some aspects of the invention in terms of energy efficiency, depending on whether hardware extensions are used or not.
[0043] Figure 2A refers to a "Full" (or "extended", i.e. with one or more hardware extensions) and "Basic" (without hardware extension) processors. The figure illustrates the energy consumed by a processor (surfaces 121 and 122) as a function of the power consumed ("Power" on the ordinate 111) during an execution time (time on abscissa 112) for the same application or thread on both types of processors. On a "Full" processor with an energy power Pf, a calculation task executed during a time tf consumes an energy EFull 121. On a processor "Basic" with an energy power Pb, a calculation task executed during a time tb 20 consumes EBasic 122. The power Pf is greater than the power Pb because the hardware extension requires more power. It is common for the execution time tb to be greater than the execution time tf because, since there is no extension, a basic processor executes the calculation task less quickly.
[0044] A general objective is to minimize the energy consumed, i.e. the area of the surface (the shortest possible execution time with minimal power). In the example of Figure 2A, the Full processor is more efficient. Considering Pb and Pf as constants, the tb / tf ratio indicates the "acceleration" of the computation due to the hardware extension.
[0045] FIG. 2B illustrates the variation of the energy gain (EBasic / EFull) as a function of the tb / tf acceleration, which indicates the opportunity to use an extended processor with hardware extension (s) rather than without. If the EBasib / EFun ratio is less than 1, this indicates that there is less power consumed with a Basic type processor than with an Extended processor (if so, this operation domain 141 is called " slighlty extended "which means that the hardware extension is not used enough). If the Esasic / EFun ratio is less than 1 and the tb / tf acceleration ratio is less than 1, this translates into a gain in both energy consumed and computation time performance to the advantage of the uncompleted processor. hardware (this can happen in cases where the extension is misused). If the EsasidEFun report is greater than 1, the extended processor consumes less power (section 143 "highly 15 extended"): the extension accelerates the code (ie the instructions) and the gains are obtained both in terms of energy ( it is less) than in terms of acceleration (faster execution). It can be noted that when EBasiciEFull is equal to 1, the tb / tf ratio is generally Pb / Pf (when the basic and extended type processors consume as much energy, the associated acceleration generally amounts to the ratio of the powers of the processors). . When tb / tf is equal to 1 (when the execution times are the same), Esasic / EFun Pf / Pb (ie it is generally observed that the energy and power consumption ratios are generally substantially equal). operation 144 corresponds to an acceleration of the calculations simultaneously to a lower energy consumption, to the advantage of an extended processor with hardware extension. A hardware extension can reduce power consumption by reducing the number of steps required to perform specific calculations. When the hardware expansion utilization is sufficiently intensive, acceleration of computation on the extension may compensate for the additional power required by the extension itself.
[0046] In comparison with coprocessors, a hardware extension may be more efficient because of the strong coupling between extension and core. Hardware extensions frequently share a significant portion of the "basic" core and take advantage of direct access to L1 and L2 cache types. This coupling notably makes it possible to reduce the data transfers and the synchronization complexity. Even though compiling tools make it possible to maximize the use of a hardware extension, hardware extender uses may remain anecdotal compared to the overall CPU load. Hardware expansion usually involves additional cost in terms of area and static power consumption. An extension often includes registers of significant size, which may require a larger bandwidth as well as larger data transfers than those required by the core circuit. If the hardware expansion usage is too low, its clean energy consumption is likely to exceed the expected energy savings due to hardware acceleration. Worse, an underutilized hardware extension can lead to a reduction in the performance of a software application. For example, in the case of intensive data transfers between the hardware extension and the core registers, the overhead of the memory transfer may cancel the benefit of the acceleration. A hardware extension is generally not "transparent" to the application developer. In contrast to super-scalar features, the developer is frequently forced to explicitly handle extended instruction sets. As a result, the use of hardware extensions can be expensive compared to the single core circuit. For example, the NEON / VFP processor ARM extension accounts for about 30% of the area of a Cortex 5 A9 processor (ignoring cache, and 10% counting L1 cache). Nevertheless, the hardware extensions are critical and remain necessary to address certain technical problems, notably in terms of power and performance required in certain application classes (for example in multimedia applications, cryptography, image processing and applications). signal). Within a multiprocessor symmetric architecture (SMP), in order to maintain a uniform instruction set for the different processor cores, extensions must be implemented in each processor core. This ISA symmetry involves the use of a significant surface area and a certain power consumption, which in turn limits the benefits of hardware acceleration via extensions.
[0047] Advantageously, the method according to the invention makes it possible to (a) asymmetrically distribute the extensions in a multi-core processor, (b) to know when the use of an extension is "useful" from a performance point of view and and (c) moving the application over time over different cores, regardless of their starting requirement (which is chosen at compile time and therefore not necessarily adequate with the execution context). architectures and placement of the 30 tasks. The figure shows examples of task scheduling on an SMP, and two types of scheduling on a FAMP. The FPU 3035243 adds FP-like instructions to an ISA circuit. In the first case 310 (SMP), each quanta of the task is executed on a processor having a FPU (Floating Point Unit). It turns out that the first and the last quanta do not contain FP instructions. In the second case 320 ("ISA-constrained"), thanks to the asymmetry of the platform, the first and the last quanta can run on a single core (not having an FPU), which reduces the consumption 10 of energy during the execution. But the 3rd quanta contains FP extensions, but not enough to "value" the FP unit. In the third case 330, the method according to the invention allows the third quanta to be executed on a "basic" type processor, which makes it possible to optimize the energy consumption to perform this calculation task. Figure 4 provides examples of steps of the method according to the invention.
[0048] The steps take place within a multi-core processor, for which a scheduler places the calculation tasks during the execution of a program. In step 400, which takes place online and in a loop, the data relating to the use of the hardware extensions is collected (for example by the scheduler). The use of extensions is monitored ("monitoring" in English). In one embodiment, the method for recovering data during execution focuses only on instructions for hardware or hardware extensions. According to this embodiment, the method is independent of the scheduling process. When code is run on a processor with one or more extensions, monitoring can rely on hardware counters. When the code is executed on a "basic" type processor, the extended instructions are no longer executed natively. For example, if necessary, it is necessary to add routines (for example at the time of adaptation of the code or in the emulation function if applicable) to count the number of instructions of each class of the extension that processor with extension will have or would have executed. In substitution or in addition, the method according to the invention may use software counters and / or hardware counters associated with said software counters. A monitoring system can be very heavy at runtime. In the context of the invention, the data collected is linked to the instructions only involving extensions, and in fact the slowdown is negligible since it can be done in parallel with the execution. In the context of a "basic" type processor, the filling of the counters can be done sequentially during the emulation of the accelerated functionality by the other cores, which can add an additional cost in terms of performance. To reduce this overhead, it is possible to collect the data periodically and, if necessary, to interpolate. In step 410, an event of the scheduler (for example the end of a quanta) is determined, which triggers step 320.
[0049] In step 420, the collected data is read. For example, the scheduler retrieves the data collected by the monitor (A.0) (for example by reading the hardware or software registers).
[0050] In step 420, a prediction is made. Thanks to the recovered data, the prediction system studies the behavior of the application with respect to the use of extended instruction classes. By analyzing this behavior, he predicts the future behavior of the application. In one embodiment, the prediction is said to be simple, i.e. reactively, assuming that the behavior of the application in the next quanta will be similar to the last quanta executed. In one embodiment, the prediction is said to be "complex" and comprises steps of determining the different types of program phases. For example, these types of phases can be determined by saving a behavior history table and analyzing that history. Signatures can be used. For example, the prediction unit may associate signatures with behaviors subsequently executed. When a signature is detected, the prediction unit may indicate the future behavior i.e. predict future known operations and associated with the signature. In step 430, an estimate is made of "degradation". ("Basic" versus "Full", i.e. estimate of the performance degradation associated with running a compute task on a processor without hardware expansion compared to that performed on a processor with hardware expansion). A given task is not tested on each teacher but it is estimated, at the time of scheduling, the cost that the task would have on different processors. In one embodiment (e.g. offline "offline"), the entire code is simulated, taking into account the complete features of the processor. This type of approach is generally not usable dynamically (online). Considering the particular case where the processor cores all have the same base, all the instructions are executed in the same way in the different cores 30 (for example two cores). Some edge effects may appear between the cores, for example due to caching, but these effects remain residual and may be taken into account when learning the model. The estimation of the performance degradation can be done in different ways. In a first approach, it is only taken into account the overall percentage of the use of the hardware extension. The estimation of the degradation may in fact be based on a model making it possible to calculate the said degradation as a function of the percentage of each instruction class by the extension. This approach is not necessarily appropriate, however, since different types of instructions can have very different accelerations for the same hardware extension.
[0051] In a second approach, an estimate of the performance degradation can take into account the classes of instructions and take a form according to: dcr crifirt, u bE: in fy, 20 According to this weighted approach, the values of the weighting coefficients 13 ("Beta") can be calculated dynamically online by performing a learning process by executing several codes on different types of processors. These values can also be calculated offline (and stored in a table readable by the scheduler 25 for example). The calibration data determined from the execution of real calculation tasks make it possible in particular to take into account the cost of emulation and the complex evolutions of the beta coefficients. For example; the beta values can be distinguished for a real platform 3235353 and for a simulated platform. At runtime (before or during the execution of a calculation task), the beta coefficients are used by the scheduler to estimate the performance degradation. The calibration must be repeated for each implementation of a hardware extension. The steps of prediction 430 and estimation of the degradation 440 are chronologically independent.
[0052] The degradation can be estimated (440) from the predicted data (430). Alternatively, the degradation 440 can be calculated from the data recovered in step 400 and 420, before the prediction 430 is based on the degradation 440 thus calculated. In this context, the future degradation is predicted (and not the future instructions of the extensions). The data changes but the prediction process remains the same. In other words, in step 430, the objective is to "predict" the number of instructions of each class that will be executed at t + 1 as a function of the past (of t, t-1 ... t-n). In step 440, the objective is to "estimate" the execution cost at t + 1 on the different types of cores. The steps 430 and 440 can be performed in any order because it is possible independently to estimate the cost at time t on the recovered data and to predict the cost at time t + 1 from the cost at time t.
[0053] Formulated yet differently, a "prediction" approach consists of a relative determination of the future from the data available in the present. An "estimate" allows a determination of the present using data observed in the past. In another perspective, an "estimate" allows the determination of the past in a new context from observations of the past in another context. According to an implementation of the method of the invention (in "runtime"), a monitoring module captures the calibration data and an analysis module 5 estimates the acceleration associated with the use of an extension. In the case of monitoring, the input data must be as close as possible to the time of the extended instructions to be executed at the next quanta (assuming a continuity assumption, eg stable use of a floating point unit). A prediction module can then be used. This assumption is realistic when programs have distinct phases of stable behavior over sufficiently long time intervals. In the case of more erratic behavior, more advanced prediction modules can be implemented.
[0054] In step 450, a decision step is performed. A logical decision unit for example may take into account the information of the predicted degradation estimate and / or the overall objectives assigned to the platform, the objectives including, for example, objectives in terms of energy reduction, performance, priority and availability of hearts. In other examples, the degradation can also take into account parameters such as the migration cost and / or the cost of adaptation of the code. In some cases, the cost of migration and the cost of the station may be static (e.g. out-of-line learning) or dynamic (e.g. e-learning or on-line time monitoring). In step 460, it is possible to proceed from the migration of one core to another ("core-switching"). The migration techniques implemented can be carried out with the context backup (for example with backup of the registers of the extension in software registers 34 3035243 accessible from the emulation). At step 470, the code is optionally adapted. For example, to execute code with extensions on a "basic" type processor, it is possible to use binary adaptation techniques (eg by code translation), multi-code version, or even emulation. The present invention may be implemented from hardware and / or software elements. It may be available as a computer program product on a computer readable medium. The support can be electronic, magnetic, optical or electromagnetic. 15 20 25 30
权利要求:
Claims (23)
[0001]
REVENDICATIONS1. A computer-implemented method for managing a computing task on a functionally asymmetric multi-core processor, wherein at least one core of said processor is associated with one or more hardware extensions, the method comprising the steps of: - receiving a computational task, said computational task being associated with executable instructions by a hardware extension associated with the processor Iti-cceu rs; receiving calibration data associated with said hardware extension; determining an opportunity cost of executing the calculation task according to the calibration data.
[0002]
2. Method according to claim 1, the computation task comprising instructions associated with one or more predefined classes of instructions and the hardware extension being associated with one or more predefined classes of instructions, said classes being executable by said extension.
[0003]
3. Method according to the preceding claims, the calibration data comprising coefficients indicative of a unit cost of execution per instruction class, said coefficients being determined by comparison between the execution of a predefined set of instructions representative of the execution space of said extension on said hardware extension on the one hand and the execution of said predefined set of instructions on a processor core without hardware extension on the other hand.
[0004]
The method of claim 3, further comprising a step of determining a number of uses of each instruction class associated with the computational task by said hardware extension.
[0005]
5. The method of claim 4, wherein the step of determining the number of uses of each class of instructions comprising a step of counting the number of uses of each class of instructions.
[0006]
6. The method according to claim 4 or 5, the step of determining the number of uses of each class of instructions comprising a step of estimating the number of uses of each class of instructions, in particular from the counted uses in the past.
[0007]
7. Method according to claim 4, wherein the opportunity cost of execution is determined by summation indexed by instruction class of the coefficients per class of instructions multiplied by the number of uses per class of instruction. 'instructions.
[0008]
8. The method of claim 3, the coefficients being determined out of line.
[0009]
9. The method of claim 3, the coefficients being determined in line. 25
[0010]
10. The method of claim 3, the coefficients being determined by multivariate statistical analysis.
[0011]
The method of any of the preceding claims, wherein the received compute task is associated with a predetermined processor core and the opportunity cost of executing the compute task is determined for at least one other processor core. that the core of predetermined processor.
[0012]
The method of claim 11, further comprising a step of determining a processor core from among the plurality of cores of the processor for performing said computational task, said step comprising the steps of determining the cost of Execution opportunity for all or part of the processor cores of the multi-core processor and minimize the opportunity cost of execution. 10
[0013]
The method of claim 11 or 12, further comprising a step of determining a processor core from among the plurality of processor cores for performing said computing task, said determining minimizing the execution time of the task calculation and / or energy cost and / or temperature. 15
[0014]
The method of claim 13, wherein the energy cost determination comprises one or more of the steps of receiving initial indications of use of one or more predefined hardware extensions and / or receiving power consumption states. processor-based DVFS energy and / or receive performance asymmetry information and a step of determining a power-gating and / or clock-gating energy optimization. 25
[0015]
The method of any of the preceding claims, further comprising a step of determining an adaptation cost of the instructions associated with the computing task, said step comprising one or more of the steps of translating one or more instructions and / or select one or more instruction versions and / or emulate one or more instructions and / or execute one or more instructions in a virtual machine. 3035243 38
[0016]
The method of any of the preceding claims, further comprising a step of receiving a parameter and / or scheduling and / or placement logic rule. 5
[0017]
The method of claim 16, further comprising a step of moving the calculation task from the predetermined processor core to the determined processor core. 10
[0018]
The method of any one of the preceding claims, further comprising a step of disabling or shutting down one or more processor cores.
[0019]
19. A method according to any one of the preceding claims, the functionally asymmetric multi-core processor being a physical processor or a virtual processor.
[0020]
20. A computer program product, said computer program comprising code instructions for performing the steps of the method of any one of claims 1 to 19, when said program is run on a computer.
[0021]
21. System comprising means for implementing the method according to any one of claims 1 to 19. 25
[0022]
22. The system of claim 21 comprising a functionally asymmetrical multi-core processor, at least one core of said processor being associated with one or more hardware extensions, the system comprising: receiving means for receiving a calculation task, said computing task being associated with executable instructions by a hardware extension associated with the mufti-cores processor; receiving means for receiving calibration data; means for determining an opportunity cost of executing the calculation task according to the calibration data. 5
[0023]
23. The system of claim 21 or 22 further comprising means selected from: - placement means for placing one or more calculation tasks on one or more cores of the processor; Means for counting the use of instruction classes by a hardware extension, said means comprising software and / or hardware counters; means or registers for saving the execution context of a calculation task; Means for determining the migration cost and / or the adaptation cost and / or the energy cost associated with continuing the execution of a calculation task on a predefined processor core; means for receiving one or more parameters and / or scheduling rules; Means for determining and / or selecting a processor core; means for executing on a processor core without associated hardware extension a computation task initially intended to execute on a processor comprising one or more hardware extensions; Means for moving a calculation task from one processor core to another processor core; means for deactivating or shutting down one or more processor cores. 30
类似技术:
公开号 | 公开日 | 专利标题
EP3286647A1|2018-02-28|Placement of a calculation task on a functionally asymmetric processor
EP2707797B1|2017-11-15|Automatic load balancing for heterogeneous cores
RU2582541C2|2016-04-27|Context-sensitive slicing for dynamically parallelising binary programs
EP2521946B1|2016-08-31|Method for selecting a resource from a plurality of processing resources such that the likely time lapses before resource failure evolve in a substantially identical manner
US9436512B2|2016-09-06|Energy efficient job scheduling in heterogeneous chip multiprocessors based on dynamic program behavior using prim model
EP2257876B1|2011-08-10|Method for preloading configurations of a reconfigurable heterogeneous system for information processing into a memory hierarchy
Xu et al.2018|Cost-effective cloud server provisioning for predictable performance of big data analytics
FR2881239A1|2006-07-28|METHOD FOR MANAGING ACCESS TO SHARED RESOURCES IN A MULTI-PROCESSOR ENVIRONMENT
US10908884B2|2021-02-02|Methods and apparatus for runtime multi-scheduling of software executing on a heterogeneous system
Tarsa et al.2019|Post-silicon cpu adaptation made practical using machine learning
US20200150941A1|2020-05-14|Heterogenous computer system optimization
Nemirovsky et al.2018|A general guide to applying machine learning to computer architecture
Ahmed et al.2021|RALB‐HC: A resource‐aware load balancer for heterogeneous cluster
Fang et al.2011|An auto-tuning solution to data streams clustering in opencl
FR3056786B1|2019-11-22|METHOD FOR MANAGING CALCULATING TASKS ON A FUNCTIONALLY ASYMMETRIC MULTI-CORE PROCESSOR
Brunetta et al.2019|Selecting efficient cloud resources for hpc workloads
Wang et al.2016|Comparison and improvement of Hadoop MapReduce performance prediction models in the private cloud
Wu et al.2014|Modeling the virtual machine launching overhead under fermicloud
Wu et al.2020|Irina: Accelerating DNN Inference with Efficient Online Scheduling
Fonseca et al.2018|Understanding the impact of task granularity in the energy consumption of parallel programs
Mytilinis et al.2018|The vision of a heterogenerous scheduler
Lee et al.2017|Energy-aware paired sampling-based decision model for dynamic mobile-to-mobile service offloading
Fowers2014|A framework for dynamic, transparent, and adaptive optimizations in heterogeneous systems
EP3502895A1|2019-06-26|Control of the power consumption of a server cluster
Girdhar2016|Predicting Thread Criticality and Frequency Scalability for Active Load Balancing in Shared Memory Multithreaded Applications
同族专利:
公开号 | 公开日
WO2016173766A1|2016-11-03|
US20180095751A1|2018-04-05|
FR3035243B1|2018-06-29|
EP3286647A1|2018-02-28|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
US20150067700A1|2012-04-12|2015-03-05|Sansung Electronics Co., Ltd.|Method and apparatus for performing task scheduling in terminal|
JP4784827B2|2006-06-06|2011-10-05|学校法人早稲田大学|Global compiler for heterogeneous multiprocessors|
US8782645B2|2011-05-11|2014-07-15|Advanced Micro Devices, Inc.|Automatic load balancing for heterogeneous cores|
WO2013101139A1|2011-12-30|2013-07-04|Intel Corporation|Providing an asymmetric multicore processor system transparently to an operating system|
US20150355700A1|2014-06-10|2015-12-10|Qualcomm Incorporated|Systems and methods of managing processor device power consumption|KR102285481B1|2015-04-09|2021-08-02|에스케이하이닉스 주식회사|method of mapping task of network-on-chip semiconductor device|
US10355975B2|2016-10-19|2019-07-16|Rex Computing, Inc.|Latency guaranteed network on chip|
US10700968B2|2016-10-19|2020-06-30|Rex Computing, Inc.|Optimized function assignment in a multi-core processor|
JP2018092311A|2016-12-01|2018-06-14|キヤノン株式会社|Information processor, control method thereof and program|
JP6860694B2|2017-04-17|2021-04-21|セレブラス システムズ インク.|Accelerated deep learning task activation|
US10795836B2|2017-04-17|2020-10-06|Microsoft Technology Licensing, Llc|Data processing performance enhancement for neural networks using a virtualized data iterator|
CA3099965A1|2017-04-17|2018-10-25|Cerebras Systems Inc.|Neuron smearing for accelerated deep learning|
US11188617B2|2019-01-10|2021-11-30|Nokia Technologies Oy|Method and network node for internet-of-thingsfeature selection for storage and computation|
US20210334101A1|2020-04-24|2021-10-28|Stephen T. Palermo|Frequency scaling for per-core accelerator assignments|
法律状态:
2016-04-28| PLFP| Fee payment|Year of fee payment: 2 |
2016-10-21| PLSC| Publication of the preliminary search report|Effective date: 20161021 |
2017-04-28| PLFP| Fee payment|Year of fee payment: 3 |
2018-04-26| PLFP| Fee payment|Year of fee payment: 4 |
2019-04-29| PLFP| Fee payment|Year of fee payment: 5 |
2020-04-30| PLFP| Fee payment|Year of fee payment: 6 |
2022-01-07| ST| Notification of lapse|Effective date: 20211205 |
优先权:
申请号 | 申请日 | 专利标题
FR1553478A|FR3035243B1|2015-04-20|2015-04-20|PLACING A CALCULATION TASK ON A FUNCTIONALLY ASYMMETRIC PROCESSOR|
FR1553478|2015-04-20|FR1553478A| FR3035243B1|2015-04-20|2015-04-20|PLACING A CALCULATION TASK ON A FUNCTIONALLY ASYMMETRIC PROCESSOR|
US15/567,067| US20180095751A1|2015-04-20|2016-03-14|Placement of a calculation task on a functionally asymmetric processor|
EP16711185.5A| EP3286647A1|2015-04-20|2016-03-14|Placement of a calculation task on a functionally asymmetric processor|
PCT/EP2016/055401| WO2016173766A1|2015-04-20|2016-03-14|Placement of a calculation task on a functionally asymmetric processor|
[返回顶部]